Skip to content

[CORE] Refactor task error logging into dedicated utility class#11718

Merged
zhztheplayer merged 2 commits intoapache:mainfrom
pratham76:pm-refactor-logging
Mar 10, 2026
Merged

[CORE] Refactor task error logging into dedicated utility class#11718
zhztheplayer merged 2 commits intoapache:mainfrom
pratham76:pm-refactor-logging

Conversation

@pratham76
Copy link
Contributor

@pratham76 pratham76 commented Mar 8, 2026

What changes are proposed in this pull request?

Extract task error logging from TaskResources into new TaskErrorLogger utility, resolving a TODO here. Improves separation of concerns while maintaining identical functionality for filtering speculative execution errors.

How was this patch tested?

Existing UTs

Was this patch authored or co-authored using generative AI tooling?

No

@github-actions github-actions bot added the CORE works for Gluten Core label Mar 8, 2026
@github-actions
Copy link

github-actions bot commented Mar 8, 2026

Run Gluten Clickhouse CI on x86

@pratham76 pratham76 force-pushed the pm-refactor-logging branch from af1dfe8 to 2764e11 Compare March 8, 2026 07:27
@github-actions
Copy link

github-actions bot commented Mar 8, 2026

Run Gluten Clickhouse CI on x86

@pratham76
Copy link
Contributor Author

@zhztheplayer Could you please review?

Copy link
Member

@zhztheplayer zhztheplayer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @pratham76, thanks for helping out.

The intention of the TODO could mean we need to rely on Spark's scheduler layer to log the error. But we had to keep the TaskFailureListener here because Gluten may cause crash in CompletionListener so Spark's scheduler code will never have a chance to log the error.

So I think the discussion here is not about how to log the error. We may need to consider open PR in upstream Spark to make logging work even CompletionListener crashed.

I might not recall everything so let me know if you have different findings here.

Copy link
Member

@zhztheplayer zhztheplayer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apart from the above comment, PR change itself LGTM.

@github-actions
Copy link

Run Gluten Clickhouse CI on x86

@pratham76
Copy link
Contributor Author

pratham76 commented Mar 10, 2026

Hi @zhztheplayer, Thanks for the clarification, I do get that the proper solution for the above mentioned TODO would be make spark's error logging more resilient when CompletionListener crashes.

In this PR, I have tried to refactor the logging logic into separate utility class, to handle both scenarios of task failures and task recompute/retries, just to make the code a bit cleaner. I'm happy to explore the upstream Spark changes for this TODO (which I have updated here to give better context). For now, could we proceed with this refactoring as an incremental improvement?

@pratham76 pratham76 force-pushed the pm-refactor-logging branch from 693fd91 to 9ef29f5 Compare March 10, 2026 02:48
@github-actions
Copy link

Run Gluten Clickhouse CI on x86

@zhztheplayer
Copy link
Member

Hi @zhztheplayer, Thanks for the clarification, I do get that the proper solution for the above mentioned TODO would be make spark's error logging more resilient when CompletionListener crashes.

In this PR, I have tried to refactor the logging logic into separate utility class, to handle both scenarios of task failures and task recompute/retries, just to make the code a bit cleaner. I'm happy to explore the upstream Spark changes for this TODO (which I have updated here to give better context). For now, could we proceed with this refactoring as an incremental improvement?

Sounds reasonable. Thanks for the patch.

@zhztheplayer zhztheplayer merged commit 205e133 into apache:main Mar 10, 2026
62 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CORE works for Gluten Core

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants